{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5747e926",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/data_connectors/WebPageDemo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30146ad2-f165-4f4b-ae07-fe6597a2964f",
   "metadata": {},
   "source": [
    "# Web Page Reader\n",
    "\n",
    "Demonstrates our web page reader.\n",
    "\n",
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f9959b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index llama-index-readers-web"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c39063b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import sys\n",
    "\n",
    "logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n",
    "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2315a154-f72d-4447-b1eb-cde9b66868cb",
   "metadata": {},
   "source": [
    "#### Using SimpleWebPageReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87bf7ecd-50cd-47da-9f0e-bc48d7ae45d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SummaryIndex\n",
    "from llama_index.readers.web import SimpleWebPageReader\n",
    "from IPython.display import Markdown, display\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6de3929-51eb-4064-b4b6-c203bb6debc4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# NOTE: the html_to_text=True option requires html2text to be installed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "663403de-2e6e-4340-ab8f-8ee681bc06aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleWebPageReader(html_to_text=True).load_data(\n",
    "    [\"http://paulgraham.com/worked.html\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8cd183a-2423-4a3e-ad92-dfe89ed5454e",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26854cc3-af61-4910-ab6b-3bed6acfb447",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SummaryIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cfdf87a-97cb-481f-ad51-be5bf8b5217f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set Logging to DEBUG for more detailed outputs\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What did the author do growing up?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7278d033-cae3-4ddf-96bd-75ea570ca53f",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e7b0a56",
   "metadata": {},
   "source": [
    "# Using Spider Reader 🕷\n",
    "[Spider](https://spider.cloud/?ref=llama_index) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.\n",
    "\n",
    "Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc... \n",
    "\n",
    "**Prerequisites:** you need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdfb59f7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(id_='54a6ecf3-b33e-41e9-8cec-48657aa2ed9b', embedding=None, metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 101750, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider - Fastest Web Crawler[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\',  headers=headers,  json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 5secs  ###To crawl 200 pages### 21x  ###Faster than FireCrawl### 150x  ###Faster than Apify Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labeling[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]\n"
     ]
    }
   ],
   "source": [
    "# Scrape single URL\n",
    "from llama_index.readers.web import SpiderWebReader\n",
    "\n",
    "spider_reader = SpiderWebReader(\n",
    "    api_key=\"YOUR_API_KEY\",  # Get one at https://spider.cloud\n",
    "    mode=\"scrape\",\n",
    "    # params={} # Optional parameters see more on https://spider.cloud/docs/api\n",
    ")\n",
    "\n",
    "documents = spider_reader.load_data(url=\"https://spider.cloud\")\n",
    "print(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "780b794e",
   "metadata": {},
   "source": [
    "Crawl domain following all deeper subpages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80c10c79",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(id_='63f7ccbf-c6c8-4f69-80f7-f6763f761a39', embedding=None, metadata={'description': 'Our privacy policy and how it plays a part in the data collected.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 26647, 'keywords': None, 'pathname': '/privacy', 'resource_type': 'html', 'title': 'Privacy', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/privacy.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=\"Privacy[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Privacy Policy==========Learn about how we take privacy with the Spider project.[Spider](https://spider.cloud) offers a cutting-edge data scraping service with powerful AI capabilities. Our data collecting platform is designed to help users maximize the benefits of data collection while embracing the advancements in AI technology. With our innovative tools, we provide a seamless and fast interactive experience. This privacy policy details Spider's approach to product development, deployment, and usage, encompassing the Crawler, AI products, and features.[AI Development at Spider----------](#ai-development-at-spider)Spider leverages a robust combination of proprietary code, open-source frameworks, and synthetic datasets to train its cutting-edge products. To continuously improve our offerings, Spider may utilize inputs from user-generated prompts and content, obtained from trusted third-party providers. By harnessing this diverse data, Spider can deliver highly precise and pertinent recommendations to our valued users. While the foundational data crawling aspect of Spider is openly available on Github, the dashboard and AI components remain closed source. Spider respects all robots.txt files declared on websites allowing data to be extracted without harming the website.[Security, Privacy, and Trust----------](#security-privacy-and-trust)At Spider, our utmost priority is the development and implementation of Crawlers, AI technologies, and products that adhere to ethical, moral, and legal standards. We are dedicated to creating a secure and respectful environment for all users. Safeguarding user data and ensuring transparency in its usage are core principles we uphold. In line with this commitment, we provide the following important disclosures when utilizing our AI-related products:* Spider ensures comprehensive disclosure of features that utilize third-party AI platforms. To provide clarity, these integrations will be clearly indicated through distinct markers, designations, explanatory notes that appear when hovering, references to the underlying codebase, or any other suitable form of notification as determined by the system. Our commitment to transparency aims to keep users informed about the involvement of third-party AI platforms in our products.* We collect and use personal data as set forth in our [Privacy Policy](https://spider.cloud/privacy) which governs the collection and usage of personal data. If you choose to input personal data into our AI products, please be aware that such information may be processed through third-party AI providers. For any inquiries or concerns regarding data privacy, feel free to reach out to us at [Spider Help Github](https://github.com/orgs/spider-rs/discussions). We are here to assist you.* Except for user-generated prompts and/or content as inputs, Spider does not use customer data, including the code related to the use of Spider's deployment services, to train or finetune any models used.* We periodically review and update our policies and procedures in an effort to comply with applicable data protection regulations and industry standards.* We use reasonable measures designed to maintain the safety of users and avoid harm to people and the environment. Spider's design and development process includes considerations for ethical, security, and regulatory requirements with certain safeguards to prevent and report misuse or abuse.[Third-Party Service Providers----------](#third-party-service-providers)In providing AI products and services, we leverage various third-party providers in the AI space to enhance our services and capabilities, and will continue to do so for certain product features.This page will be updated from time to time with information about Spider's use of AI. The current list of third-party AI providers integrated into Spider is as follows:* [Anthropic](https://console.anthropic.com/legal/terms)* [Azure Cognitive Services](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy)* [Cohere](https://cohere.com/terms-of-use)* [ElevenLabs](https://elevenlabs.io/terms)* [Hugging Face](https://huggingface.co/terms-of-service)* [Meta AI](https://www.facebook.com/policies_center/)* [OpenAI](https://openai.com/policies)* [Pinecone](https://www.pinecone.io/terms)* [Replicate](https://replicate.com/terms)We prioritize the safety of our users and take appropriate measures to avoid harm both to individuals and the environment. Our design and development processes incorporate considerations for ethical practices, security protocols, and regulatory requirements, along with established safeguards to prevent and report any instances of misuse or abuse. We are committed to maintaining a secure and respectful environment and upholding responsible practices throughout our services.[Acceptable Use----------](#acceptable-use)Spider's products are intended to provide helpful and respectful responses to user prompts and queries while collecting data along the web. We don't allow the use of our Scraper or AI tools, products and services for the following usages:* Denial of Service Attacks* Illegal activity* Inauthentic, deceptive, or impersonation behavior* Any other use that would violate Spider's standard published policies, codes of conduct, or terms of service.Any violation of this Spider AI Policy or any Spider policies or terms of service may result in termination of use of services at Spider's sole discretion. We will review and update this Spider AI Policy so that it remains relevant and effective. If you have feedback or would like to report any concerns or issues related to the use of AI systems, please reach out to [support@spider.cloud](mailto:support@spider.cloud).[More Information----------](#more-information)To learn more about Spider's integration of AI capabilities into products and features, check out the following resources:* [Spider-Rust](https://github.com/spider-rs)* [Spider](/)* [About](/)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='18e4d35d-ff48-4d00-b924-abab7a06fbec', embedding=None, metadata={'description': 'Learn how to crawl and scrape websites with the fastest web crawler built for the job.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 27058, 'keywords': None, 'pathname': '/guides', 'resource_type': 'html', 'title': 'Spider Guides', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider Guides[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Spider Guides==========Learn how to crawl and scrape websites easily.(4) Total Guides* [  Spider v1 Logo  Spider Platform  ----------  How to use the platform to collect data from the internet fast, affordable, and unblockable.  ](/guides/spider)* [  Spider v1 Logo  Spider API  ----------  How to use the Spider API to curate data from any source blazing fast. The most advanced crawler that handles all workloads of all sizes.  ](/guides/spider-api)* [  Spider v1 Logo  Extract Contacts  ----------  Get contact information from any website in real time with AI. The only way to accurately get dynamic information from websites.  ](/guides/pipelines-extract-contacts)* [  Spider v1 Logo  Website Archiving  ----------  The programmable time machine that can store pages and all assets for easy website archiving.  ](/guides/website-archiving)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='b10c6402-bc35-4fec-b97c-fa30bde54ce8', embedding=None, metadata={'description': 'Complete reference documentation for the Spider API. Includes code snippets and examples for quickly getting started with the system.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 195426, 'keywords': None, 'pathname': '/docs/api', 'resource_type': 'html', 'title': 'Spider API Reference', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/docs*_*api.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider API Reference[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)API Reference==========The Spider API is based on REST. Our API is predictable, returns [JSON-encoded](http://www.json.org/) responses, uses standard HTTP response codes, authentication, and verbs. Set your API secret key in the `authorization` header to commence. You can use the `content-type` header with `application/json`, `application/xml`, `text/csv`, and `application/jsonl` for shaping the response.The Spider API supports multi domain actions. You can work with multiple domains per request by adding the urls comma separated.The Spider API differs for every account as we release new versions and tailor functionality. You can add `v1` before any path to pin to the version.Just getting started?----------Check out our [development quickstart](/guides/spider-api) guide.Not a developer?----------Use Spiders [no-code options or apps](/guides/spider) to get started with Spider and to do more with your Spider account no code required.Base UrlJSONCopy```https://api.spider.cloud```Crawl websites==========Start crawling a website(s) to collect resources.POST https://api.spider.cloud/crawlRequest body* url\\xa0required\\xa0string  ----------  The URI resource to crawl. This can be a comma split list for multiple urls.  Test Url* request\\xa0string  ----------  The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML.  HTTP* limit\\xa0number  ----------  The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.  Crawl Limit* depth\\xa0number  ----------  The crawl limit for maximum depth. If zero, no limit will be applied.  Crawl DepthSet Example* cache\\xa0boolean  ----------  Use HTTP caching for the crawl to speed up repeated runs.  Set Example* budget\\xa0object  ----------  Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`.  Crawl Budget  Set Example* locale\\xa0string  ----------  The locale to use for request, example `en-US`.  Set Example* cookies\\xa0string  ----------  Add HTTP cookies to use for request.  Set Example* stealth\\xa0boolean  ----------  Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.  Set Example* headers\\xa0string  ----------  Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.  Set Example* metadata\\xa0boolean  ----------  Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.  Set Example* viewport\\xa0object  ----------  Configure the viewport for chrome. Defaults to 800x600.  Set Example* encoding\\xa0string  ----------  The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc.  Set Example* subdomains\\xa0boolean  ----------  Allow subdomains to be included.  Set Example* user\\\\_agent\\xa0string  ----------  Add a custom HTTP user agent to the request.  Set Example* store\\\\_data\\xa0boolean  ----------  Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false.  Set Example* gpt\\\\_config\\xa0object  ----------  Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps.  Set Example* fingerprint\\xa0boolean  ----------  Use advanced fingerprint for chrome.  Set Example* storageless\\xa0boolean  ----------  Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.  Set Example* readability\\xa0boolean  ----------  Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage.  Set Example* return\\\\_format\\xa0string  ----------  The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc.  Raw* proxy\\\\_enabled\\xa0boolean  ----------  Enable high performance premium proxies for the request to prevent being blocked at the network level.  Set Example* query\\\\_selector\\xa0string  ----------  The CSS query selector to use when extracting content from the markup.  Test Query Selector* full\\\\_resources\\xa0boolean  ----------  Crawl and download all the resources for a website.  Set Example* request\\\\_timeout\\xa0number  ----------  The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.  Set Example* run\\\\_in\\\\_background\\xa0boolean  ----------  Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.  Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\',  headers=headers,  json=json_data)print(response.json())```ResponseCopy```[  {    \"content\": \"<html>...\",    \"error\": null,    \"status\": 200,    \"url\": \"http://www.example.com\"  },  // more content...]```Crawl websites get links==========Start crawling a website(s) to collect links found.POST https://api.spider.cloud/linksRequest body* url\\xa0required\\xa0string  ----------  The URI resource to crawl. This can be a comma split list for multiple urls.  Test Url* request\\xa0string  ----------  The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML.  HTTP* limit\\xa0number  ----------  The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.  Crawl Limit* depth\\xa0number  ----------  The crawl limit for maximum depth. If zero, no limit will be applied.  Crawl DepthSet Example* cache\\xa0boolean  ----------  Use HTTP caching for the crawl to speed up repeated runs.  Set Example* budget\\xa0object  ----------  Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`.  Crawl Budget  Set Example* locale\\xa0string  ----------  The locale to use for request, example `en-US`.  Set Example* cookies\\xa0string  ----------  Add HTTP cookies to use for request.  Set Example* stealth\\xa0boolean  ----------  Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.  Set Example* headers\\xa0string  ----------  Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.  Set Example* metadata\\xa0boolean  ----------  Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.  Set Example* viewport\\xa0object  ----------  Configure the viewport for chrome. Defaults to 800x600.  Set Example* encoding\\xa0string  ----------  The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc.  Set Example* subdomains\\xa0boolean  ----------  Allow subdomains to be included.  Set Example* user\\\\_agent\\xa0string  ----------  Add a custom HTTP user agent to the request.  Set Example* store\\\\_data\\xa0boolean  ----------  Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false.  Set Example* gpt\\\\_config\\xa0object  ----------  Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps.  Set Example* fingerprint\\xa0boolean  ----------  Use advanced fingerprint for chrome.  Set Example* storageless\\xa0boolean  ----------  Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.  Set Example* readability\\xa0boolean  ----------  Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage.  Set Example* return\\\\_format\\xa0string  ----------  The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc.  Raw* proxy\\\\_enabled\\xa0boolean  ----------  Enable high performance premium proxies for the request to prevent being blocked at the network level.  Set Example* query\\\\_selector\\xa0string  ----------  The CSS query selector to use when extracting content from the markup.  Test Query Selector* full\\\\_resources\\xa0boolean  ----------  Crawl and download all the resources for a website.  Set Example* request\\\\_timeout\\xa0number  ----------  The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.  Set Example* run\\\\_in\\\\_background\\xa0boolean  ----------  Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.  Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/links\\',  headers=headers,  json=json_data)print(response.json())```ResponseCopy```[  {    \"content\": \"\",    \"error\": null,    \"status\": 200,    \"url\": \"http://www.example.com\"  },  // more content...]```Screenshot websites==========Start taking screenshots of website(s) to collect images to base64 or binary.POST https://api.spider.cloud/screenshotRequest bodyGeneralSpecific* url\\xa0required\\xa0string  ----------  The URI resource to crawl. This can be a comma split list for multiple urls.  Test Url* request\\xa0string  ----------  The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML.  HTTP* limit\\xa0number  ----------  The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.  Crawl Limit* depth\\xa0number  ----------  The crawl limit for maximum depth. If zero, no limit will be applied.  Crawl DepthSet Example* cache\\xa0boolean  ----------  Use HTTP caching for the crawl to speed up repeated runs.  Set Example* budget\\xa0object  ----------  Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`.  Crawl Budget  Set Example* locale\\xa0string  ----------  The locale to use for request, example `en-US`.  Set Example* cookies\\xa0string  ----------  Add HTTP cookies to use for request.  Set Example* stealth\\xa0boolean  ----------  Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.  Set Example* headers\\xa0string  ----------  Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.  Set Example* metadata\\xa0boolean  ----------  Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.  Set Example* viewport\\xa0object  ----------  Configure the viewport for chrome. Defaults to 800x600.  Set Example* encoding\\xa0string  ----------  The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc.  Set Example* subdomains\\xa0boolean  ----------  Allow subdomains to be included.  Set Example* user\\\\_agent\\xa0string  ----------  Add a custom HTTP user agent to the request.  Set Example* store\\\\_data\\xa0boolean  ----------  Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false.  Set Example* gpt\\\\_config\\xa0object  ----------  Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps.  Set Example* fingerprint\\xa0boolean  ----------  Use advanced fingerprint for chrome.  Set Example* storageless\\xa0boolean  ----------  Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.  Set Example* readability\\xa0boolean  ----------  Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage.  Set Example* return\\\\_format\\xa0string  ----------  The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc.  Raw* proxy\\\\_enabled\\xa0boolean  ----------  Enable high performance premium proxies for the request to prevent being blocked at the network level.  Set Example* query\\\\_selector\\xa0string  ----------  The CSS query selector to use when extracting content from the markup.  Test Query Selector* full\\\\_resources\\xa0boolean  ----------  Crawl and download all the resources for a website.  Set Example* request\\\\_timeout\\xa0number  ----------  The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.  Set Example* run\\\\_in\\\\_background\\xa0boolean  ----------  Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.  Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/screenshot\\',  headers=headers,  json=json_data)print(response.json())```ResponseCopy```[  {    \"content\": \"base64...\",    \"error\": null,    \"status\": 200,    \"url\": \"http://www.example.com\"  },  // more content...]```Pipelines----------Create powerful workflows with our pipeline API endpoints. Use AI to extract contacts from any website or filter links with prompts with ease.Crawl websites and extract contacts==========Start crawling a website(s) to collect all contacts found leveraging AI.POST https://api.spider.cloud/pipeline/extract-contactsRequest bodyGeneralSpecific* url\\xa0required\\xa0string  ----------  The URI resource to crawl. This can be a comma split list for multiple urls.  Test Url* request\\xa0string  ----------  The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML.  HTTP* limit\\xa0number  ----------  The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.  Crawl Limit* depth\\xa0number  ----------  The crawl limit for maximum depth. If zero, no limit will be applied.  Crawl DepthSet Example* cache\\xa0boolean  ----------  Use HTTP caching for the crawl to speed up repeated runs.  Set Example* budget\\xa0object  ----------  Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`.  Crawl Budget  Set Example* locale\\xa0string  ----------  The locale to use for request, example `en-US`.  Set Example* cookies\\xa0string  ----------  Add HTTP cookies to use for request.  Set Example* stealth\\xa0boolean  ----------  Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.  Set Example* headers\\xa0string  ----------  Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.  Set Example* metadata\\xa0boolean  ----------  Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.  Set Example* viewport\\xa0object  ----------  Configure the viewport for chrome. Defaults to 800x600.  Set Example* encoding\\xa0string  ----------  The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc.  Set Example* subdomains\\xa0boolean  ----------  Allow subdomains to be included.  Set Example* user\\\\_agent\\xa0string  ----------  Add a custom HTTP user agent to the request.  Set Example* store\\\\_data\\xa0boolean  ----------  Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false.  Set Example* gpt\\\\_config\\xa0object  ----------  Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps.  Set Example* fingerprint\\xa0boolean  ----------  Use advanced fingerprint for chrome.  Set Example* storageless\\xa0boolean  ----------  Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.  Set Example* readability\\xa0boolean  ----------  Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage.  Set Example* return\\\\_format\\xa0string  ----------  The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc.  Raw* proxy\\\\_enabled\\xa0boolean  ----------  Enable high performance premium proxies for the request to prevent being blocked at the network level.  Set Example* query\\\\_selector\\xa0string  ----------  The CSS query selector to use when extracting content from the markup.  Test Query Selector* full\\\\_resources\\xa0boolean  ----------  Crawl and download all the resources for a website.  Set Example* request\\\\_timeout\\xa0number  ----------  The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.  Set Example* run\\\\_in\\\\_background\\xa0boolean  ----------  Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.  Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/pipeline/extract-contacts\\',  headers=headers,  json=json_data)print(response.json())```ResponseCopy```[  {    \"content\": [{ \"full_name\": \"John Doe\", \"email\": \"johndoe@gmail.com\", \"phone\": \"555-555-555\", \"title\": \"Baker\"}, ...],    \"error\": null,    \"status\": 200,    \"url\": \"http://www.example.com\"  },  // more content...]```Label website==========Crawl a website and accurately categorize it using AI.POST https://api.spider.cloud/pipeline/labelRequest bodyGeneralSpecific* url\\xa0required\\xa0string  ----------  The URI resource to crawl. This can be a comma split list for multiple urls.  Test Url* request\\xa0string  ----------  The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML.  HTTP* limit\\xa0number  ----------  The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.  Crawl Limit* depth\\xa0number  ----------  The crawl limit for maximum depth. If zero, no limit will be applied.  Crawl DepthSet Example* cache\\xa0boolean  ----------  Use HTTP caching for the crawl to speed up repeated runs.  Set Example* budget\\xa0object  ----------  Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`.  Crawl Budget  Set Example* locale\\xa0string  ----------  The locale to use for request, example `en-US`.  Set Example* cookies\\xa0string  ----------  Add HTTP cookies to use for request.  Set Example* stealth\\xa0boolean  ----------  Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.  Set Example* headers\\xa0string  ----------  Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.  Set Example* metadata\\xa0boolean  ----------  Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.  Set Example* viewport\\xa0object  ----------  Configure the viewport for chrome. Defaults to 800x600.  Set Example* encoding\\xa0string  ----------  The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc.  Set Example* subdomains\\xa0boolean  ----------  Allow subdomains to be included.  Set Example* user\\\\_agent\\xa0string  ----------  Add a custom HTTP user agent to the request.  Set Example* store\\\\_data\\xa0boolean  ----------  Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false.  Set Example* gpt\\\\_config\\xa0object  ----------  Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps.  Set Example* fingerprint\\xa0boolean  ----------  Use advanced fingerprint for chrome.  Set Example* storageless\\xa0boolean  ----------  Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.  Set Example* readability\\xa0boolean  ----------  Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage.  Set Example* return\\\\_format\\xa0string  ----------  The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc.  Raw* proxy\\\\_enabled\\xa0boolean  ----------  Enable high performance premium proxies for the request to prevent being blocked at the network level.  Set Example* query\\\\_selector\\xa0string  ----------  The CSS query selector to use when extracting content from the markup.  Test Query Selector* full\\\\_resources\\xa0boolean  ----------  Crawl and download all the resources for a website.  Set Example* request\\\\_timeout\\xa0number  ----------  The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.  Set Example* run\\\\_in\\\\_background\\xa0boolean  ----------  Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.  Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/pipeline/label\\',  headers=headers,  json=json_data)print(response.json())```ResponseCopy```[  {    \"content\": [\"Government\"],    \"error\": null,    \"status\": 200,    \"url\": \"http://www.example.com\"  },  // more content...]```Crawl State==========Get the state of the crawl for the domain.POST https://api.spider.cloud/crawl/statusRequest body* url\\xa0required\\xa0string  ----------  The URI resource to crawl. This can be a comma split list for multiple urls.  Test UrlShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}response = requests.post(\\'https://api.spider.cloud/crawl/status\\',  headers=headers)print(response.json())```ResponseCopy```  {    \"content\": {        \"data\": {          \"id\": \"195bf2f2-2821-421d-b89c-f27e57ca71fh\",          \"user_id\": \"6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg\",          \"domain\": \"example.com\",          \"url\": \"https://example.com/\",          \"links\":1,          \"credits_used\": 3,          \"mode\":2,          \"crawl_duration\": 340,          \"message\": null,          \"request_user_agent\": \"Spider\",          \"level\": \"info\",          \"status_code\": 0,          \"created_at\": \"2024-04-21T01:21:32.886863+00:00\",          \"updated_at\": \"2024-04-21T01:21:32.886863+00:00\"        },        \"error\": \"\"      },    \"error\": null,    \"status\": 200,    \"url\": \"http://www.example.com\"  }```Credits Available==========Get the remaining credits available.GET https://api.spider.cloud/credits* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}response = requests.post(\\'https://api.spider.cloud/credits\\',  headers=headers)print(response.json())```ResponseCopy```{ \"credits\": 52566 }```[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='44b350c3-f907-4767-84ec-a73fe59c190c', embedding=None, metadata={'description': 'End User License Agreement for the Spiderwebai and the spider project.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 20123, 'keywords': None, 'pathname': '/eula', 'resource_type': 'html', 'title': 'EULA', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/eula.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='EULA[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)End User License Agreement==========Our end user license agreement may change from time to time as we build out the software.Right to Ban----------Part of making sure the Spider is being used for the right purpose we will not allow malicious acts to be done with the system. If we find that you are using the tool to hack, crawl illegal pages, porn, or anything that falls into this line will be banned from the system. You can reach out to us to weigh out your reasons on why you should not be banned.License----------You can use the API and service to build ontop of. Replicating the features and re-selling the service is not allowed. We do not provide any custom license for the platform and encourage users to use our system to handle any crawling, scraping, or data curation needs for speed and cost effectiveness.### Adjustments to Plans ###The software is very new and while we figure out what we can charge to maintain the systems the plans may change. We will send out a notification of the changes in our [Discord](https://discord.gg/5bDPDxwTn3) or Github. For the most part plans will increase drastically with things set to scale costs that allow more usage for everyone. Spider is a product of[A11yWatch LLC](https://a11ywatch.com) the web accessibility tool. The crawler engine of Spider powers the curation for A11yWatch allowing auditing websites accessibility compliance extremely fast.#### Contact ####For information about how to contact Spider, please reach out to email below.[support@spider.cloud](mailto:support@spider.cloud)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='445c0c76-bfd5-4f89-a439-fbdeb8077a4c', embedding=None, metadata={'description': 'Spider is the fastest web crawler written in Rust. The Cloud version is a hosted version of open-source project.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 139080, 'keywords': None, 'pathname': '/about', 'resource_type': 'html', 'title': 'About', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/about.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='About[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) About==========Spider is the fastest web crawler written in Rust. The Cloud version is a hosted version of open-source project. Spider Features----------Our features that facilitate website scraping and provide swift insights in one platform. Deliver astonishing results using our powerful API.### Fast Unblockable Scraping ###When it comes to speed, the Spider project is the fastest web crawler available to the public. Utilize the foundation of open-source tools and make the most of your budget to scrape content effectively.Collecting Data Logo### Gain Website Insights with AI ###Enhance your crawls with AI to obtain relevant information fast from any website.AI Search### Extract Data Using Webhooks ###Set up webhooks across your websites to deliver the desired information anywhere you need.News Logo[A11yWatch](https://a11ywatch.com)maintains the project and the hosting for the service.[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='1a2d63a5-0315-4c5b-8fed-8ac460b82cc7', embedding=None, metadata={'description': 'Add the amount of credits you want to purchase for scraping the internet with AI and LLM data curation abilities fast.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 23083, 'keywords': None, 'pathname': '/credits/new', 'resource_type': 'html', 'title': 'Purchase Spider Credits', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/credits*_*new.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Purchase Spider Credits[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Add credits==========Add credits to start crawling any website today.|Default|      Features      |       Amount       ||-------|--------------------|--------------------||Default| Scraping Websites  |$0.03 / gb bandwidth|| Extra |  Premium Proxies   |$0.01 / gb bandwidth|| Extra |Javascript Rendering|$0.01 / gb bandwidth|| Extra |    Data Storage    |  $0.30 / gb month  || Extra |      AI Chat       | $0.01 input/output |[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='6701b47a-0000-4111-8b5b-c77b01937a7d', embedding=None, metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 101750, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider - Fastest Web Crawler[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\',  headers=headers,  json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 5secs  ###To crawl 200 pages### 21x  ###Faster than FireCrawl### 150x  ###Faster than Apify Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labeling[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='91b98a80-7112-4837-8389-cb78221b254c', embedding=None, metadata={'description': 'Get contact information from any website in real time with AI. The only way to accurately get dynamic information from websites.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 25891, 'keywords': None, 'pathname': '/guides/pipelines-extract-contacts', 'resource_type': 'html', 'title': 'Guides - Extract Contacts', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*pipelines-extract-contacts.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Extract Contacts[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Extract Contacts==========Contents----------* [Seamless extracting any contact any website](#seamless-extracting-any-contact-any-website)* [UI (Extracting Contacts)](#ui-extracting-contacts)* [API Extracting Usage](#api-extracting-usage)  * [API Extracting Example](#api-extracting-example)  * [Pipelines Combo](#pipelines-combo)Seamless extracting any contact any website----------Extracting contacts from a website used to be a very difficult challenge involving many steps that would change often. The challenges typically faced involve being able to get the data from a website without being blocked and setting up query selectors for the information you need using javascript. This would often break in two folds - the data extracting with a correct stealth technique or the css selector breaking as they update the website HTML code. Now we toss those two hard challenges away - one of them spider takes care of and the other the advancement in AI to process and extract information.UI (Extracting Contacts)----------You can use the UI on the dashboard to extract contacts after you crawled a page. Go to the page youwant to extract and click on the horizontal dropdown menu to display an option to extract the contact.The crawl will get the data first to see if anything new has changed. Afterwards if a contact was found usually within 10-60 seconds you will get a notification that the extraction is complete with the data.![Extracting contacts with the spider app](/img/app/extract-contacts.png)After extraction if the page has contact related data you can view it with a grid in the app.![The menu displaying the found contacts after extracting with the spider app](/img/app/extract-contacts-found.png)The grid will display the name, email, phone, title, and host(website found) of the contact(s).![Grid display of all the contact information found for the web page](/img/app/extract-contacts-grid.png)API Extracting Usage----------The endpoint `/pipeline/extract-contacts` provides the ability to extract all contacts from a website concurrently.### API Extracting Example ###To extract contacts from a website you can follow the example below. All params are optional except `url`. Use the `prompt` param to adjust the way the AI handles the extracting. If you use the param `store_data` or if the website already exist in the dashboard the contact data will be saved with the page.```import requests, os, jsonheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":1,\"url\":\"http://www.example.com/contacts\", \"model\": \"gpt-4-1106-preview\", \"prompt\": \"A custom prompt to tailor the extracting.\"}response = requests.post(\\'https://api.spider.cloud/crawl/pipeline/extract-contacts\\',  headers=headers,  json=json_data,  stream=True)for line in response.iter_lines():  if line:      print(json.loads(line))```### Pipelines Combo ###Piplines bring a whole new entry to workflows for data curation, if you combine the API endpoints to only use the extraction on pages you know may have contacts can save credits on the system. One way would be to perform gathering all the links first with the `/links` endpoint. After getting the links for the pages use `/pipeline/filter-links` with a custom prompt that can use AI to reduce the noise of the links to process before `/pipline/extract-contacts`.Loading graph...Written on:  2/1/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='5e7ade0d-0a50-46de-8116-72ee5dca0b20', embedding=None, metadata={'description': 'How to use the Spider API to curate data from any source blazing fast. The most advanced crawler that handles all workloads of all sizes.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 24752, 'keywords': None, 'pathname': '/guides/spider-api', 'resource_type': 'html', 'title': 'Guides - Spider API', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*spider-api.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Spider API[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Getting started Spider API==========Contents----------* [API built to scale](#api-built-to-scale)* [API Usage](#api-usage)* [Crawling One Page](#crawling-one-page)* [Crawling Multiple Pages](#crawling-multiple-pages)  * [Planet Scale Crawling](#planet-scale-crawling)    * [Automatic Configuration](#automatic-configuration)API built to scale----------Welcome to our cutting-edge web crawler SaaS, renowned for its unparalleled speed.Our platform is designed to effortlessly manage thousands of requests per second, thanks to our elastically scalable system architecture and the Open-Source [spider](https://github.com/spider-rs/spider) project. We deliver consistent latency times ensuring swift processing for all responses.For an in-depth understanding of the request parameters supported, we invite you to explore our comprehensive API documentation. At present, we do not provide client-side libraries, as our API has been crafted with simplicity in mind for straightforward usage. However, we are open to expanding our offerings in the future to enhance user convenience.Dive into our [documentation]((/docs/api)) to get started and unleash the full potential of our web crawler today.API Usage----------Getting started with the API is simple and straight forward. After you get your [secret key](/api-keys)you can access our instance directly. We have one main endpoint `/crawl` that handles all things relatedto data curation. The crawler is highly configurable through the params to fit all needs.Crawling One Page----------Most cases you probally just want to crawl one page. Even if you only need one page, our system performs fast enough to lead the race.The most straight forward way to make sure you only crawl a single page is to set the [budget limit](./account/settings) with a wild card value or `*` to 1.You can also pass in the param `limit` in the JSON body with the limit of pages.Crawling Multiple Pages----------When you crawl multiple pages, the concurrency horsepower of the spider kicks in. You might wonder why and how one request may take (x)ms to come back, and 100 requests take about the same time! That’s because the built-in isolated concurrency allows for crawling thousands to millions of pages in no time. It’s the only current solution that can handle large websites with over 100k pages within a minute or two (sometimes even in a blink or two). By default, we do not add any limits to crawls unless specified.### Planet Scale Crawling ###If you plan on processing crawls that have over 200 pages, we recommend streaming the request from the client instead of parsing the entire payload once finished. We have an example of this with Python on the API docs page, also shown below.```import requests, os, jsonheaders = {    \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"],    \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":250,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl/crawl\\',  headers=headers,  json=json_data,  stream=True)for line in response.iter_lines():  if line:      print(json.loads(line))```#### Automatic Configuration ####Spider handles automatic concurrency handling and ip rotation to make it simple to curate data.The more credits you have or usage available allows for a higher concurrency limit.Written on:  1/3/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='08e5f1d6-4ae7-4b68-ab96-4b6a3768e88c', embedding=None, metadata={'description': 'The programmable time machine that can store pages and all assets for easy website archiving.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 18970, 'keywords': None, 'pathname': '/guides/website-archiving', 'resource_type': 'html', 'title': 'Guides - Website Archiving', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*website-archiving.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Website Archiving[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Website Archiving==========With Spider you can easily backup or capture a website at any point in time.Enable Full Resource storing in the settings or website configuration to get a 1:1 copy of any websitelocally.Time Machine----------Time machine is storing data at a certain point of a time. Spider brings this to you with one simple configuration.After running the crawls you can simply download the data. This can help store assets incase the code is lost orversion control is removed.Written on:  2/7/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='024cb27e-21d2-49a5-8a1a-963e72038421', embedding=None, metadata={'description': 'How to use the platform to collect data from the internet fast, affordable, and unblockable.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 24666, 'keywords': None, 'pathname': '/guides/spider', 'resource_type': 'html', 'title': 'Guides - Spider Platform', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*spider.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Spider Platform[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Getting started collecting data with Spider==========Contents----------* [Data Curation](#data-curation)  * [Crawling (Website)](#crawling-website)  * [Crawling (API)](#crawling-api)* [Crawl Configuration](#crawl-configuration)  * [Proxies](#proxies)  * [Headless Browser](#headless-browser)  * [Crawl Budget Limits](#crawl-budget-limits)* [Crawling and Scraping Websites](#crawling-and-scraping-websites)  * [Transforming Data](#transforming-data)    * [Leveraging Open Source](#leveraging-open-source)* [Subscription and Spider Credits](#subscription-and-spider-credits)Data Curation----------Collecting data with Spider can be fast and rewarding if done with some simple preliminary steps.Use the dashboard to collect data seamlessly across the internet with scheduled updates.You have two main ways of collecting data using Spider. The first and simplest is to use the UI available for scraping.The alternative is to use the API to programmatically access the system and perform actions.### Crawling (Website) ###1. Register or login to your account using email or Github.2. Purchase [credits](/credits/new) to kickstart crawls with `pay-as-you-go` go after credits deplete.3. Configure crawl [settings](/account/settings) to fit workflows that you need.4. Navigate to the [dashboard](/) and enter a website url or ask a question to get a url that should be crawled.5. Crawl the website and export/download the data as needed.### Crawling (API) ###1. Register or login to your account using email or Github.2. Purchase [credits](/credits/new) to kickstart crawls with `pay-as-you-go` after credits deplete.3. Configure crawl [settings](/account/settings) to fit workflows that you need.4. Navigate to [API keys](/api-keys) and create a new secret key.5. Go to the [API docs](/docs/api) page to see how the API works and perform crawls with code examples.Crawl Configuration----------Configuration your account for how you would like to crawl can help save costs or effectiveness of the content. Some of the configurations include setting Premium Proxies, Headless Browser Rendering, Webhooks, and Budgeting.### Proxies ###Using proxies with our system is straight forward. Simple check the toggle on if you want all request to use a proxy to increase the success of not being blocked.![Proxies example app screenshot.](/img/app/proxy-setting.png)### Headless Browser ###If you want pages that require JavaScript to be executed the headless browser config is for you. Enabling will run all request through a real Chrome Browser for JavaScript required rendering pages.![Headless browser example app screenshot.](/img/app/headless-browser.png)### Crawl Budget Limits ###One of the key things you may need to do before getting into the crawl is setting up crawl-budgets.Crawl budgets allows you to determine how many pages you are going to crawl for a website.Determining the budget will save you costs when dealing with large websites that you only want certain data points from. The example below shows adding a asterisk (\\\\*) to determine all routes with a limit of 50 pages maximum. The settings can be overwritten by the website configuration or parameters if using the API.![Crawl budget example screenshot](/img/app/edit-budget.png)Crawling and Scraping Websites----------Collecting data can be done in many ways and for many reasons. Leveraging our state-of-the-art technology allows you to create fast workloads that can process content from multiple locations. At the time of writing, we have started to focus on our data processing API instead of the dashboard. The API has much more flexibility than the UI for performing advanced workloads like batching, formatting, and so on.![Dashboard UI for Spider displaying data collecting from www.napster.com, jeffmendez.com, rsseau.rs, and www.drake.com](/img/app/ui-crawl.png)### Transforming Data ###The API has more features for gathering the content in different formats and transforming the HTML as needed. You can transform the content from HTML to Markdown and feed it to a LLM for better handling the learning aspect. The API is the first class citizen for the application. The UI will have the features provided by the API eventually as the need arises.#### Leveraging Open Source ####One of the reasons Spider is the ultimate data-curation service for scraping is from the power of Open-Source. The core of the engine is completly available on [Github](https://github.com/spider-rs/spider) under [MIT](https://opensource.org/license/mit/) to show what is in store. We are constantly working on the crawler features including performance with plans to maintain the project for the long run.Subscription and Spider Credits----------The platform allows purchasing credits that gives you the ability to crawl at any time.When you purchase credits a crawl subscription is created that allows you to continue to usethe platform when your credits deplete. The limits provided coralate with the amount of creditspurchased, an example would be if you bought $5 in credits you would have about $40 in spending limit - $10 in credit gives $80 and so on.The highest purchase of credits directly determines how much is allowed on the platform. You can view your usage and credits on the [usage limits page](/account/usage).Written on:  1/2/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='44bff527-c7f3-4346-a2f8-1454c52e1b01', embedding=None, metadata={'description': 'Generate API keys that allow access to the system programmatically anywhere. Full management access for your Spider API journey.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 28770, 'keywords': None, 'pathname': '/api-keys', 'resource_type': 'html', 'title': 'API Keys Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/api-keys.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=\"API Keys Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) API Keys==========Generate API keys that allow access to the system programmatically anywhere. Full management access for your Spider API journey. Key Management----------Your secret API keys are listed below. Please note that we do not display your secret API keys again after you generate them.Do not share your API key with others, or expose it in the browser or other client-side code. In order to protect the security of your account, Spider may also automatically disable any API key that we've found has leaked publicly.Filter Name...Columns|   Name    |Key|Created|Last Used|   ||-----------|---|-------|---------|---||No results.|   |       |         |   |0 of 0 row(s) selected.PreviousNext[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='e577c57a-2376-452f-8c39-04d1e284595c', embedding=None, metadata={'description': 'Explore your usage and set limits that work with your budget.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 21195, 'keywords': None, 'pathname': '/account/usage', 'resource_type': 'html', 'title': 'Usage - Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/account*_*usage.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=\"Usage - Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) Usage limit==========Below you'll find a summary of usage for your account. The data may be delayed up to 5 minutes.Credits----------###  Pay as you go  ######  Approved usage limit  ### The maximum usage Spider allows for your organization each month. Ask for increase.###  Set a monthly budget  ###When your organization reaches this usage threshold each month, subsequent requests will be rejected. Data may be deleted if payments are rejected.[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='e3eb1e3c-5080-4590-94e8-fd2ef4f6d3c6', embedding=None, metadata={'description': 'Adjust your spider settings to adjust your crawl settings.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 18322, 'keywords': None, 'pathname': '/account/settings', 'resource_type': 'html', 'title': 'Settings - Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/account*_*settings.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Settings - Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]\n"
     ]
    }
   ],
   "source": [
    "# Crawl domain with deeper crawling following subpages\n",
    "from llama_index.readers.web import SpiderWebReader\n",
    "\n",
    "spider_reader = SpiderWebReader(\n",
    "    api_key=\"YOUR_API_KEY\",\n",
    "    mode=\"crawl\",\n",
    "    # params={} # Optional parameters see more on https://spider.cloud/docs/api\n",
    ")\n",
    "\n",
    "documents = spider_reader.load_data(url=\"https://spider.cloud\")\n",
    "print(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36f671f6",
   "metadata": {},
   "source": [
    "For guides and documentation, visit [Spider](https://spider.cloud/docs/api)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "005d14cd",
   "metadata": {},
   "source": [
    "# Using Browserbase Reader 🅱️\n",
    "\n",
    "[Browserbase](https://browserbase.com) is a serverless platform for running headless browsers, it offers advanced debugging, session recordings, stealth mode, integrated proxies and captcha solving.\n",
    "\n",
    "## Installation and Setup\n",
    "\n",
    "- Get an API key and Project ID from [browserbase.com](https://browserbase.com) and set it in environment variables (`BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID`).\n",
    "- Install the [Browserbase SDK](http://github.com/browserbase/python-sdk):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c74e6425",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install browserbase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c23d02bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.web import BrowserbaseWebReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e71d347",
   "metadata": {},
   "outputs": [],
   "source": [
    "reader = BrowserbaseWebReader()\n",
    "docs = reader.load_data(\n",
    "    urls=[\n",
    "        \"https://example.com\",\n",
    "    ],\n",
    "    # Text mode\n",
    "    text_content=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15f46387",
   "metadata": {},
   "source": [
    "### Using FireCrawl Reader 🔥\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd82bd7c",
   "metadata": {},
   "source": [
    "Firecrawl is an api that turns entire websites into clean, LLM accessible markdown."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45f8ac3f",
   "metadata": {},
   "source": [
    "Using Firecrawl to gather an entire website"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a41579cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install firecrawl-py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f8b884f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.web import FireCrawlWebReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6f8dd98",
   "metadata": {},
   "outputs": [],
   "source": [
    "# using firecrawl to crawl a website\n",
    "firecrawl_reader = FireCrawlWebReader(\n",
    "    api_key=\"<your_api_key>\",  # Replace with your actual API key from https://www.firecrawl.dev/\n",
    "    mode=\"scrape\",  # Choose between \"crawl\" and \"scrape\" for single page scraping\n",
    "    params={\"additional\": \"parameters\"},  # Optional additional parameters\n",
    ")\n",
    "\n",
    "# Load documents from a single page URL\n",
    "documents = firecrawl_reader.load_data(url=\"http://paulgraham.com/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b97adc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SummaryIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f867baa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set Logging to DEBUG for more detailed outputs\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What did the author do growing up?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fda42e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b11b2d94",
   "metadata": {},
   "source": [
    "Using firecrawl for a single page\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "870e74da",
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31mRunning cells with '/opt/homebrew/bin/python3' requires the ipykernel package.\n",
      "\u001b[1;31mRun the following command to install 'ipykernel' into the Python environment. \n",
      "\u001b[1;31mCommand: '/opt/homebrew/bin/python3 -m pip install ipykernel -U --user --force-reinstall'"
     ]
    }
   ],
   "source": [
    "# Initialize the FireCrawlWebReader with your API key and desired mode\n",
    "from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader\n",
    "\n",
    "firecrawl_reader = FireCrawlWebReader(\n",
    "    api_key=\"<your_api_key>\",  # Replace with your actual API key from https://www.firecrawl.dev/\n",
    "    mode=\"scrape\",  # Choose between \"crawl\" and \"scrape\" for single page scraping\n",
    "    params={\"additional\": \"parameters\"},  # Optional additional parameters\n",
    ")\n",
    "\n",
    "# Load documents from a single page URL\n",
    "documents = firecrawl_reader.load_data(url=\"http://paulgraham.com/worked.html\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce0cbeb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SummaryIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "955dce83",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set Logging to DEBUG for more detailed outputs\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What did the author do growing up?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0336385",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a57351a5",
   "metadata": {},
   "source": [
    "Using FireCrawl's extract mode to extract structured data from URLs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "008a7724",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the FireCrawlWebReader with your API key and extract mode\n",
    "from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader\n",
    "\n",
    "firecrawl_reader = FireCrawlWebReader(\n",
    "    api_key=\"<your_api_key>\",  # Replace with your actual API key from https://www.firecrawl.dev/\n",
    "    mode=\"extract\",  # Use extract mode to extract structured data\n",
    "    params={\n",
    "        \"prompt\": \"Extract the title, author, and main points from this essay\",\n",
    "        # Required prompt parameter for extract mode\n",
    "    },\n",
    ")\n",
    "\n",
    "# Load documents by providing a list of URLs to extract data from\n",
    "documents = firecrawl_reader.load_data(\n",
    "    urls=[\n",
    "        \"https://www.paulgraham.com\",\n",
    "        \"https://www.paulgraham.com/worked.html\",\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "693592bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SummaryIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50a5292e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Query the extracted structured data\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What are the main points from these essays?\")\n",
    "\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e73ad2c0",
   "metadata": {},
   "source": [
    "# Using Hyperbrowser Reader ⚡\n",
    "\n",
    "[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.\n",
    "\n",
    "Key Features:\n",
    "- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches\n",
    "- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright\n",
    "- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more\n",
    "- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies\n",
    "\n",
    "For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81e65326",
   "metadata": {},
   "source": [
    "## Installation and Setup\n",
    "\n",
    "- Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the `HYPERBROWSER_API_KEY` environment variable or you can pass it to the `HyperbrowserWebReader` constructor.\n",
    "- Install the [Hyperbrowser SDK](https://github.com/hyperbrowserai/python-sdk):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5e9d55a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install hyperbrowser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13b951f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.web import HyperbrowserWebReader\n",
    "\n",
    "reader = HyperbrowserWebReader(api_key=\"your_api_key_here\")\n",
    "docs = reader.load_data(\n",
    "    urls=[\"https://example.com\"],\n",
    "    operation=\"scrape\",\n",
    ")\n",
    "docs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2708dc99-0e4d-4c7e-b180-8392286d87c2",
   "metadata": {},
   "source": [
    "#### Using TrafilaturaWebReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa2d54c6-c694-4852-a743-165e4777bd56",
   "metadata": {},
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'llama_index.readers.web'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[7], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mllama_index\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mreaders\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mweb\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m TrafilaturaWebReader\n",
      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'llama_index.readers.web'"
     ]
    }
   ],
   "source": [
    "from llama_index.readers.web import TrafilaturaWebReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46854f2f-426e-40a3-a87f-5fb51f90e14c",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = TrafilaturaWebReader().load_data(\n",
    "    [\"http://paulgraham.com/worked.html\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80752ad3-1ed8-4695-9247-22efbe475746",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SummaryIndex.from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cc9b154-1dcf-479b-b49b-251874aea506",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set Logging to DEBUG for more detailed outputs\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What did the author do growing up?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "971b6415-8bcd-4d8b-a1de-9b7ada3cd392",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2b6d07c",
   "metadata": {},
   "source": [
    "### Using RssReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5ad5ca8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SummaryIndex\n",
    "from llama_index.readers.web import RssReader\n",
    "\n",
    "documents = RssReader().load_data(\n",
    "    [\"https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml\"]\n",
    ")\n",
    "\n",
    "index = SummaryIndex.from_documents(documents)\n",
    "\n",
    "# set Logging to DEBUG for more detailed outputs\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What happened in the news today?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d012fb0e",
   "metadata": {},
   "source": [
    "## Using ScrapFly\n",
    "ScrapFly is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown or text. Install ScrapFly Python SDK using pip:\n",
    "```shell\n",
    "pip install scrapfly-sdk\n",
    "```\n",
    "\n",
    "Here is a basic usage of ScrapflyReader "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65bbf11f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.web import ScrapflyReader\n",
    "\n",
    "# Initiate ScrapflyReader with your ScrapFly API key\n",
    "scrapfly_reader = ScrapflyReader(\n",
    "    api_key=\"Your ScrapFly API key\",  # Get your API key from https://www.scrapfly.io/\n",
    "    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions\n",
    ")\n",
    "\n",
    "# Load documents from URLs as markdown\n",
    "documents = scrapfly_reader.load_data(\n",
    "    urls=[\"https://web-scraping.dev/products\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f328f6d2",
   "metadata": {},
   "source": [
    "The ScrapflyReader also allows passigng ScrapeConfig object for customizing the scrape request. See the documentation for the full feature details and their API params: https://scrapfly.io/docs/scrape-api/getting-started"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5198f444",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.web import ScrapflyReader\n",
    "\n",
    "# Initiate ScrapflyReader with your ScrapFly API key\n",
    "scrapfly_reader = ScrapflyReader(\n",
    "    api_key=\"Your ScrapFly API key\",  # Get your API key from https://www.scrapfly.io/\n",
    "    ignore_scrape_failures=True,  # Ignore unprocessable web pages and log their exceptions\n",
    ")\n",
    "\n",
    "scrapfly_scrape_config = {\n",
    "    \"asp\": True,  # Bypass scraping blocking and antibot solutions, like Cloudflare\n",
    "    \"render_js\": True,  # Enable JavaScript rendering with a cloud headless browser\n",
    "    \"proxy_pool\": \"public_residential_pool\",  # Select a proxy pool (datacenter or residnetial)\n",
    "    \"country\": \"us\",  # Select a proxy location\n",
    "    \"auto_scroll\": True,  # Auto scroll the page\n",
    "    \"js\": \"\",  # Execute custom JavaScript code by the headless browser\n",
    "}\n",
    "\n",
    "# Load documents from URLs as markdown\n",
    "documents = scrapfly_reader.load_data(\n",
    "    urls=[\"https://web-scraping.dev/products\"],\n",
    "    scrape_config=scrapfly_scrape_config,  # Pass the scrape config\n",
    "    scrape_format=\"markdown\",  # The scrape result format, either `markdown`(default) or `text`\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f81ccdb7",
   "metadata": {},
   "source": [
    "# Using ZyteWebReader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aee6d871",
   "metadata": {},
   "source": [
    "ZyteWebReader allows a user to access the content of webpage in different modes (\"article\", \"html-text\", \"html\"). \n",
    "It enables user to change setting such as browser rendering and JS as the content of many sites would require setting these options to access relevant content. All supported options can be found here: https://docs.zyte.com/zyte-api/usage/reference.html\n",
    "\n",
    "To install dependencies:\n",
    "```shell\n",
    "pip install zyte-api\n",
    "```\n",
    "\n",
    "To get access to your ZYTE API key please visit: https://docs.zyte.com/zyte-api/get-started.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31e1aaa5-8bfc-452f-9c72-15def22f872f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5871\n"
     ]
    }
   ],
   "source": [
    "from llama_index.readers.web import ZyteWebReader\n",
    "\n",
    "# Required to run it in notebook\n",
    "# import nest_asyncio\n",
    "# nest_asyncio.apply()\n",
    "\n",
    "\n",
    "# Initiate ZyteWebReader with your Zyte API key\n",
    "zyte_reader = ZyteWebReader(\n",
    "    api_key=\"your ZYTE API key here\",\n",
    "    mode=\"article\",  # or \"html-text\" or \"html\"\n",
    ")\n",
    "\n",
    "urls = [\n",
    "    \"https://www.zyte.com/blog/web-scraping-apis/\",\n",
    "    \"https://www.zyte.com/blog/system-integrators-extract-big-data/\",\n",
    "]\n",
    "\n",
    "documents = zyte_reader.load_data(\n",
    "    urls=urls,\n",
    ")\n",
    "\n",
    "print(len(documents[0].text))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c21ae76e-1b2c-480e-a58f-9f9becce15a6",
   "metadata": {},
   "source": [
    "Browser rendering and javascript can be enabled by passing setting corresponding parameters during initialization. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f49f22bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "zyte_dw_params = {\n",
    "    \"browserHtml\": True,  # Enable browser rendering\n",
    "    \"javascript\": True,  # Enable JavaScript\n",
    "}\n",
    "\n",
    "# Initiate ZyteWebReader with your Zyte API key and use default \"article\" mode\n",
    "zyte_reader = ZyteWebReader(\n",
    "    api_key=\"your ZYTE API key here\",\n",
    "    download_kwargs=zyte_dw_params,\n",
    ")\n",
    "\n",
    "# Load documents from URLs\n",
    "documents = zyte_reader.load_data(\n",
    "    urls=urls,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74b5d21f-7f53-4412-8f11-bbc84d85a1b5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4355"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(documents[0].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "133d26d7-c26d-40b2-b08f-6c838fd3a6b6",
   "metadata": {},
   "source": [
    "Set \"continue_on_failure\" to False if you'd like to stop when any request fails."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "006254a3-5af8-4a0d-8bf0-b16b9e3dce5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "zyte_reader = ZyteWebReader(\n",
    "    api_key=\"your ZYTE API key here\",\n",
    "    mode=\"html-text\",\n",
    "    download_kwargs=zyte_dw_params,\n",
    "    continue_on_failure=False,\n",
    ")\n",
    "\n",
    "# Load documents from URLs\n",
    "documents = zyte_reader.load_data(\n",
    "    urls=urls,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bfb8e5d-7690-4a55-9052-365cbf2c9ce8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "17488"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(documents[0].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f642faae-198e-4fad-9742-c590991c8810",
   "metadata": {},
   "source": [
    "In default mode (\"article\") only the article text is extracted while in the \"html-text\" full text is extracted from the webpage, there the length of the text is significantly longer. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccba060e",
   "metadata": {},
   "source": [
    "# Using AgentQLWebReader 🐠"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b0c6edb",
   "metadata": {},
   "source": [
    "Use AgentQL to scrape structured data from a website."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "527d33af",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.web import AgentQLWebReader\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from IPython.display import Markdown, display"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9850f9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using AgentQL to crawl a website\n",
    "agentql_reader = AgentQLWebReader(\n",
    "    api_key=\"YOUR_API_KEY\",  # Replace with your actual API key from https://dev.agentql.com\n",
    "    params={\n",
    "        \"is_scroll_to_bottom_enabled\": True\n",
    "    },  # Optional additional parameters\n",
    ")\n",
    "\n",
    "# Load documents from a single page URL\n",
    "document = agentql_reader.load_data(\n",
    "    url=\"https://www.ycombinator.com/companies?batch=W25\",\n",
    "    query=\"{ company[] { name location description industry_category link(a link to the company's detail on Ycombinator)} }\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e97d460",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = VectorStoreIndex.from_documents(document)\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\n",
    "    \"Find companies that are working on web agent, list their names, locations and link\"\n",
    ")\n",
    "\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48979b8f9ab2fdd4",
   "metadata": {},
   "source": [
    "# Using OxylabsWebReader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92afafd58264c9ad",
   "metadata": {},
   "source": [
    "OxylabsWebReader allows a user to scrape any website with different parameters while bypassing most of the anti-bot tools. Check out the [Oxylabs documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/other-websites) to get the full list of parameters.\n",
    "\n",
    "Claim free API credentials by creating an Oxylabs account [here](https://oxylabs.io/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da648546f8d6aacf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Legend of Zelda: Ocarina of Time | Oxylabs Scraping Sandbox\n",
      "\n",
      "[![]()![logo]()](/)\n",
      "\n",
      "Game platforms:\n",
      "\n",
      "* **All**\n",
      "\n",
      "* [Nintendo platform](/products/category/nintendo)\n",
      "\n",
      "+ wii\n",
      "+ wii-u\n",
      "+ nintendo-64\n",
      "+ switch\n",
      "+ gamecube\n",
      "+ game-boy-advance\n",
      "+ 3ds\n",
      "+ ds\n",
      "\n",
      "* [Xbox platform](/products/category/xbox-platform)\n",
      "\n",
      "* **Dreamcast**\n",
      "\n",
      "* [Playstation platform](/products/category/playstation-platform)\n",
      "\n",
      "* **Pc**\n",
      "\n",
      "* **Stadia**\n",
      "\n",
      "Go Back\n",
      "\n",
      "Note!This is a sandbox website used for web scraping. Information listed in this website does not have any real meaning and should not be associated with the actual products.\n",
      "\n",
      "![The Legend of Zelda: Ocarina of Time]()\n",
      "\n",
      "The Legend of Zelda: Ocarina of Time\n",
      "------------------------------------\n",
      "\n",
      "**Developer:** Nintendo**Platform:****Type:** singleplayer\n",
      "\n",
      "As a young boy, Link is tricked by Ganondorf, the King of the Gerudo Thieves. The evil human uses Link to gain access to the Sacred Realm, where he places his tainted hands on Triforce and transforms the beautiful Hyrulean landscape into a barren wasteland. Link is determined to fix the problems he helped to create, so with the help of Rauru he travels through time gathering the powers of the Seven Sages.\n",
      "\n",
      "91,99 €\n",
      "\n",
      "In stock\n",
      "\n",
      "Add to Basket\n",
      "\n",
      "[![The_Legend_of_Zelda:_Majora's_Mask]()\n",
      "\n",
      "#### The Legend of Zelda: Majora's Mask](/products/20)\n",
      "\n",
      "Action Adventure Fantasy\n",
      "\n",
      "Thrown into a parallel world by the mischievous actions of a possessed Skull Kid, Link finds a land in grave danger. The dark power of a relic called Majora's Mask has wreaked havoc on the citizens of Termina, but their most urgent problem is a suicidal moon crashing toward the world. Link has only 72 hours to find a way to stop its descent.\n",
      "\n",
      "91,99 €\n",
      "\n",
      "Add to Basket\n",
      "\n",
      "[![Indiana_Jones_and_the_Infernal_Machine]()\n",
      "\n",
      "#### Indiana Jones and the Infernal Machine](/products/1836)\n",
      "\n",
      "Action Adventure Historic\n",
      "\n",
      "1947. The nazis have been crushed, the Cold War has begun and Soviet agents are sniffing around an ancient ruin. Grab your whip and fedora and join Indy in a globespanning race to unearth the mysterious \"Infernal Machine\". Survive the challenges of unusual beasts, half the Red Army and more (including - oh no - snakes!) . Puzzle your way through 17 chapters of an action-packed story. Travel the world to exotic locales, from the ruins of Babylon to Egyptian deserts. All the weapons you'll need, including firearms, explosives-and of course Indy's trusty whip and revolver.\n",
      "\n",
      "80,99 €\n",
      "\n",
      "Add to Basket\n"
     ]
    }
   ],
   "source": [
    "from llama_index.readers.web import OxylabsWebReader\n",
    "\n",
    "\n",
    "reader = OxylabsWebReader(\n",
    "    username=\"OXYLABS_USERNAME\", password=\"OXYLABS_PASSWORD\"\n",
    ")\n",
    "\n",
    "documents = reader.load_data(\n",
    "    [\n",
    "        \"https://sandbox.oxylabs.io/products/1\",\n",
    "        \"https://sandbox.oxylabs.io/products/2\",\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(documents[0].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74f7e96aa69c66c2",
   "metadata": {},
   "source": [
    "Another example with parameters for selecting the geolocation, user agent type, JavaScript rendering, headers, and cookies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebf7f4550189d1b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = reader.load_data(\n",
    "    [\n",
    "        \"https://sandbox.oxylabs.io/products/3\",\n",
    "    ],\n",
    "    {\n",
    "        \"geo_location\": \"Berlin, Germany\",\n",
    "        \"render\": \"html\",\n",
    "        \"user_agent_type\": \"mobile\",\n",
    "        \"context\": [\n",
    "            {\"key\": \"force_headers\", \"value\": True},\n",
    "            {\"key\": \"force_cookies\", \"value\": True},\n",
    "            {\n",
    "                \"key\": \"headers\",\n",
    "                \"value\": {\n",
    "                    \"Content-Type\": \"text/html\",\n",
    "                    \"Custom-Header-Name\": \"custom header content\",\n",
    "                },\n",
    "            },\n",
    "            {\n",
    "                \"key\": \"cookies\",\n",
    "                \"value\": [\n",
    "                    {\"key\": \"NID\", \"value\": \"1234567890\"},\n",
    "                    {\"key\": \"1P JAR\", \"value\": \"0987654321\"},\n",
    "                ],\n",
    "            },\n",
    "            {\"key\": \"http_method\", \"value\": \"get\"},\n",
    "            {\"key\": \"follow_redirects\", \"value\": True},\n",
    "            {\"key\": \"successful_status_codes\", \"value\": [808, 909]},\n",
    "        ],\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8af04ee7",
   "metadata": {},
   "source": [
    "# Using ZenRows Web Reader 🌐\n",
    "\n",
    "[ZenRows](https://www.zenrows.com/) is a powerful web scraping API that provides advanced features for bypassing anti-bot measures and extracting data from modern websites.\n",
    "\n",
    "Key Features:\n",
    "- **JavaScript Rendering**: Handle SPAs and dynamic content with headless browser rendering\n",
    "- **Premium Proxies**: Bypass anti-bot protection with 55M+ residential IPs from 190+ countries  \n",
    "- **Session Management**: Maintain the same IP across multiple requests\n",
    "- **Advanced Data Extraction**: Use CSS selectors or automatic parsing to extract specific data\n",
    "- **Multiple Output Formats**: Get results in HTML, Markdown, Text, or PDF format\n",
    "- **Geolocation Support**: Use proxies from specific countries for geo-restricted content\n",
    "\n",
    "**Prerequisites:** You need to have a ZenRows API key to use this reader. You can get one at [zenrows.com](https://app.zenrows.com/register).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af4ed863",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic web scraping with ZenRows\n",
    "from llama_index.readers.web import ZenRowsWebReader\n",
    "\n",
    "zenrows_reader = ZenRowsWebReader(\n",
    "    api_key=\"YOUR_API_KEY\",  # Get one at https://app.zenrows.com/register\n",
    "    response_type=\"markdown\",\n",
    ")\n",
    "\n",
    "# Scrape a single URL\n",
    "documents = zenrows_reader.load_data([\"https://httpbin.io/html\"])\n",
    "print(documents[0].text[:500])  # Print first 500 characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79f3ba0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Advanced scraping with anti-bot bypass\n",
    "zenrows_advanced = ZenRowsWebReader(\n",
    "    api_key=\"YOUR_API_KEY\",\n",
    "    js_render=True,  # Enable JavaScript rendering\n",
    "    premium_proxy=True,  # Use residential proxies\n",
    "    proxy_country=\"us\",  # Optional: specify country\n",
    ")\n",
    "\n",
    "documents = zenrows_advanced.load_data(\n",
    "    [\"https://www.scrapingcourse.com/antibot-challenge\"]\n",
    ")\n",
    "print(f\"Scraped {len(documents[0].text)} characters with advanced features\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eea1cd48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Integration with LlamaIndex - scraping multiple pages\n",
    "zenrows_reader = ZenRowsWebReader(\n",
    "    api_key=\"YOUR_API_KEY\", js_render=True, response_type=\"markdown\"\n",
    ")\n",
    "\n",
    "# Scrape multiple URLs\n",
    "urls = [\"https://example.com/\", \"https://httpbin.io/html\"]\n",
    "\n",
    "documents = zenrows_reader.load_data(urls)\n",
    "\n",
    "# Create index and query\n",
    "index = SummaryIndex.from_documents(documents)\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"What content was found on these pages?\")\n",
    "\n",
    "display(Markdown(f\"<b>{response}</b>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b4d7393",
   "metadata": {},
   "source": [
    "For more advanced features like custom headers, CSS data extraction, screenshot capabilities, and detailed configuration options, visit the [ZenRows documentation](https://docs.zenrows.com/universal-scraper-api/api-reference)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "olostep-title",
   "metadata": {},
   "source": [
    "# Using Olostep Web Reader 🧢"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "olostep-intro",
   "metadata": {},
   "source": [
    "[Olostep](https://www.olostep.com/) is reliable and **cost-effective web scraping API built for scale.** It bypasses bot detection, delivers results in seconds, and can process millions of requests. \n",
    "\n",
    "The API returns clean data from any website in various formats, including Markdown, HTML, and structured JSON. \n",
    "\n",
    "Sign up [here](https://www.olostep.com/auth) and get 1000 credits for free."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "olostep-scrape-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Olostep offers a Web Scraping API that provides clean data for AI applications from any website in just 1-5 seconds. The API can handle up to 100K requests in minutes, making it efficient and cost-effective. Users can sign up for free with an invite code and access various features like structured data extraction, parsers for common websites, and batch executions for scaling up to 100K URLs in 5-7 minutes. Olostep emphasizes reliability, scalability, and affordability, catering to startups, AI developers, and businesses needing web data extraction services. Additionally, the API supports JS execution, residential IPs, and various output formats like Markdown, HTML, PDF, and structured JSON.\n"
     ]
    }
   ],
   "source": [
    "# Scraping content in Markdown\n",
    "\n",
    "from llama_index.readers.web import OlostepWebReader\n",
    "from llama_index.core import SummaryIndex\n",
    "\n",
    "# Initialize the reader in scrape mode\n",
    "reader = OlostepWebReader(api_key=\"YOUR_OLOSTEP_API_KEY\", mode=\"scrape\")\n",
    "\n",
    "# Load data from a URL\n",
    "documents = reader.load_data(url=\"https://www.olostep.com/\")\n",
    "\n",
    "# Create index and query\n",
    "index = SummaryIndex.from_documents(documents)\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"Summarize in 100 words\")\n",
    "\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "olostep-search-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Latest AI News and AI Breakthroughs that Matter Most\n",
      "Advancements in AI and Machine Learning\n",
      "Top 11 New Technologies in AI: Exploring the Latest Trends\n",
      "AI News | Latest AI News, Analysis & Events\n",
      "Year in review: Google's biggest AI advancements of 2024\n",
      "Clarifying The Latest AI Advancements\n",
      "5 examples of the most advanced AI | Achieve better ROI now\n",
      "6 AI trends you'll see more of in 2025\n",
      "The future of AI: trends shaping the next 10 years\n",
      "5 AI Trends Shaping Innovation and ROI in 2025\n"
     ]
    }
   ],
   "source": [
    "# Running Google Searches\n",
    "\n",
    "from llama_index.readers.web import OlostepWebReader\n",
    "from llama_index.core import SummaryIndex\n",
    "\n",
    "# Initialize the reader in search mode\n",
    "reader = OlostepWebReader(api_key=\"YOUR_OLOSTEP_API_KEY\", mode=\"search\")\n",
    "\n",
    "# Load data using a search query\n",
    "documents = reader.load_data(query=\"What are the latest advancements in AI?\")\n",
    "\n",
    "# You can also pass additional parameters, for example, to specify the country for the search\n",
    "documents_with_params = reader.load_data(\n",
    "    query=\"What are the latest advancements in AI?\", params={\"country\": \"US\"}\n",
    ")\n",
    "\n",
    "# Create index and query\n",
    "index = SummaryIndex.from_documents(documents)\n",
    "query_engine = index.as_query_engine()\n",
    "response = query_engine.query(\"List me the headlines\")\n",
    "\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07117c04",
   "metadata": {},
   "source": [
    "# Using Scrapy Web Reader 🕸️"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22fd0310",
   "metadata": {},
   "source": [
    "Scrapy is a popular web crawling framework for Python. The ScrapyWebReader allows you to leverage Scrapy's powerful crawling capabilities to extract data from websites. It can be used in 2 ways\n",
    "\n",
    "1. By providing an Scrapy spider class.\n",
    "2. By providing the path to a Scrapy project."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0462b632",
   "metadata": {},
   "source": [
    "### 1. Using with Scrapy Spider Class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25da4f69",
   "metadata": {},
   "outputs": [],
   "source": [
    "from scrapy.spiders import Spider\n",
    "from llama_index.readers.web import ScrapyWebReader\n",
    "\n",
    "\n",
    "class SampleSpider(Spider):\n",
    "    name = \"sample_spider\"\n",
    "    start_urls = [\"http://quotes.toscrape.com\"]\n",
    "\n",
    "    def parse(self, response):\n",
    "        ...\n",
    "\n",
    "\n",
    "reader = ScrapyWebReader()\n",
    "docs = reader.load_data(SampleSpider)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e99c6e02",
   "metadata": {},
   "source": [
    "### 2. Using with Scrapy Project Path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1110e52e",
   "metadata": {},
   "source": [
    "Downloading a Sample Scrapy Project"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40060d02",
   "metadata": {},
   "outputs": [],
   "source": [
    "%git clone https://github.com/scrapy/quotesbot.git"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91d304d4",
   "metadata": {},
   "source": [
    "Using the scrapy project with spider named \"toscrape-css\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cf448df",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.web import ScrapyWebReader\n",
    "\n",
    "reader = ScrapyWebReader(project_path=\"./quotesbot\")\n",
    "docs = reader.load_data(\"toscrape-css\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12c85cd4",
   "metadata": {},
   "source": [
    "### Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce6769ec",
   "metadata": {},
   "source": [
    "Some keys from the scraped items can be stored as metadata in the Document object. You can specify which keys to include as metadata using the `metadata_keys` parameter. If you want to keep the keys in both the content and as metadata, you can set the `keep_keys` parameter to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c3f6112",
   "metadata": {},
   "outputs": [],
   "source": [
    "reader = ScrapyWebReader(\n",
    "    project_path=\"./quotesbot\",\n",
    "    metadata_keys=[\"author\", \"tags\"],\n",
    "    keep_keys=True,\n",
    ")\n",
    "docs = reader.load_data(\"toscrape-css\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index-KZjFUsTf-py3.13",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
